Investigating Measures for Pairwise Document Similarity

نویسندگان

  • Jeffrey Isaacs
  • Javed Aslam
چکیده

The need for a more effective similarity measure is growing as a result of the astonishing amount of information being placed online. Most existing similarity measures are defined by empirically derived formulas and cannot easily be extended to new applications. We present a pairwise document similarity measure based on Information Theory, and present corpus dependent and independent applications of this measure. When ranked with existing similarity measures over TREC FBIS data, our corpus dependent information theoretic similarity measure ranked first.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SYDE 676 Project Report – Fall 2002 Web Document Clustering Using Phrase-based Document Similarity

Measuring the similarity between documents is an essential operation in text mining, especially document clustering. The traditional method of finding the similarity between documents has always been based on extracting individual words from the documents, and using heuristics to give weights to those features. Standard methods in data mining are then used to find the similarity between documen...

متن کامل

Similarity Measures in Documents Using Association Graphs

In this paper we present a new model, designated as Association Graph, to improve document representation, facilitating the ontological dimension. We explain how to generate and use this kind of graph. Also, we analyze different document similarity measures based on this representation. A classical vector space model was used to evaluate this model and measures, investigating their strengths an...

متن کامل

بررسی قابلیت بهکارگیری سنجه های مرکزیت به عنوان شاخصهای ارتباط استنادی مدارک در بازیابی اطلاعات رابطه ای: مطالعۀ مقدماتی

Purpose: this is a pilot study tends to investigate correlation between centrality measures with bibliographic coupling as a well-known citation-based document similarity measure.  Methodology: using citation analysis method, 40 research articles belonging to four engineering/pure disciplines (Physics, Chemistry, Biology, and computer) and four Humanities and Social disciplines (Economics, Edu...

متن کامل

Pairwise Document Similarity in Large Collections with MapReduce

This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a colle...

متن کامل

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

The task of plagiarism detection entails two main steps, suspicious candidate retrieval and pairwise document similarity analysis also called detailed analysis. In this paper we focus on the second subtask. We will report our monolingual plagiarism detection system which is used to process the Persian plagiarism corpus for the task of pairwise document similarity. To retrieve plagiarised passag...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999